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Abstract — We consider the problem of nonlinear dimension- 
ality reduction: given a training set of high-dimensional data 
whose "intrinsic" low dimension is assumed known, find a feature 
extraction map to low-dimensional space, a reconstruction map 
back to high-dimensional space, and a geometric description of 
the dimension-reduced data as a smooth manifold. We introduce 
a complexity-regularized quantization approach for fitting a 
Gaussian mixture model to the training set via a Lloyd algo- 
rithm. Complexity regularization controls the trade-off between 
adaptation to the local shape of the underlying manifold and 
global geometric consistency. The resulting mixture model is used 
to design the feature extraction and reconstruction maps and to 
define a Riemannian metric on the low-dimensional data. We also 
sketch a proof of consistency of our scheme for the purposes of 
estimating the unknown underlying pdf of high-dimensional data. 

I. Introduction 

When dealing with high volumes of vector-valued data of 
some large dimension n, it is often assumed that the data 
possess some intrinsic geometric description in a space of 
unknown dimension k < n and that the high dimensionality 
arises from an unknown stochastic mapping of M k into IT. 
We can pose the problem of nonlinear dimensionality reduc- 
tion (NLDR) [1], [2] as follows: given raw data with values 
in H", we wish to obtain optimal estimates of the intrinsic 
dimension k and of the stochastic map with the purpose of 
modeling the intrinsic geometry of the data in 1R k . 

One typically considers the following set-up: we are given 
a sample X N = (Xi, . . . , Xjv), where Xi are i.i.d. according 
to an unknown absolutely continuous distribution P*. The 
corresponding pdf /* has to be estimated from the observation 
as /at = Jn(X n ). The intrinsic dimension k of the data may 
not be known in advance and would also have be estimated as 
fcjv = kff(X N ). Since the pdf /* is assumed to arise from a 
stochastic map of the low-dimensional space TR k into the high- 
dimensional space IR™, we can use our knowledge about k and 
/* in order to make inferences about the intrinsic geometry of 
the data. In the absence of such knowledge, any such inference 
has to be made based on the estimates k^ and f^. In this paper 
we introduce a complexity-regularized quantization approach 
to NLDR, assuming that the intrinsic dimension k of the data 
is given (e.g., as a maximum-likelihood estimate [3]). 

II. Smooth manifolds and their noisy embeddings 

We begin with a quick sketch of some notions about smooth 
manifolds [4]. A smooth manifold of dimension k is a set M 
together with a collection A = {{Ui, <p{) : I 6 A}, where the 
sets Ui C M cover M and each map ipi is a bijection of Ui 



onto an open set ifi{Ui) C IR , such that for all 1,1' with 
UiOUv T^themap^o^- 1 : ip^Ui^Uv) ip v (Ui<T\U v ) 
is smooth. The pairs (Ui,cpi) are called charts of M, and the 
entire collection A is referred to as an atlas. Intuitively, the 
charts describe the points of M by local coordinates: given 
p 6 M and a chart (Ui 3 p,ifi), ipi maps any point q "near 
p" (i.e., q S Ui) to an element of fi{Ui) C JR k . Smoothness 
of the transition maps ipi> o tp^ 1 ensures that local coordinates 
of a point transform differentiably under a change of chart. 

Assuming that M is compact, we can always choose the 
atlas A in such a way that the indexing set A is finite and 
each fi(Ui) is an open ball of radius r; [4, Thm. 3.3] (one 
can always set r; = 1 for all I G A, but we choose not to do 
this for greater flexibility in modeling). 

The next notion we need is that of a tangent space to 
M at point p, denoted by T p M. Let / C IR be an open 
interval such that G /. Consider the set of all curves 
£ : I — ► M such that £(0) = p. Then for any chart (Ui 3 p, ipi) 
we have a function = ipi a £ : I — > IR*, such that 
£i(t) G fi{Ui) for all t in a sufficiently small neighborhood 
of 0. We say that two such curves £, £' are equivalent iff 

dfoWMLo = <j(*)M|t=o- 3 = !,•••> *. for alH G A 
such that Ui 3 p, where £ij(t) are the components of £z(t). 
The resulting set of equivalence classes has the structure of a 
vector space of dimension k, and is precisely the tangent space 
TpM. Intuitively, T p M allows us to "linearize" M around p. 
Note that, although all the tangent spaces T p M,p G M are 
isomorphic to each other and to IR' 1 , there is no meaningful 
way to add elements of T p M and T q M with p, q distinct. 

Next, we specify the class of stochastic embeddings dealt 
with in this paper. Consider three random variables L, Y, X, 
where L takes values in the finite set A with wi = Pr(L = I), 
Y takes values in lR fe , and X takes values in IR™. Conditional 
distributions of Y given L and of X given Y, L are assumed 
to be absolutely continuous and described by densities f Y \L 
and fx\YLi respectively. Since for a compact M the images 
fi{Ui) of charts in A are open balls of radii 77, let us suppose 
that the conditional mean mi(Y) = E[Y"|L = I] is the center 
of <pi(Ui) [we can therefore take mi(Y) = for all I G A] 
and that the largest eigenvalue of the conditional covariance 
matrix Ki(Y) = E [YF* \L = l] of Y given L = lis equal to 
rf. It is convenient to think of the eigenvectors ej , . . . , of 
Ki(Y) as giving a basis of the tangent space T^-i, ,M. The 
unconditional density fx of X is the finite mixture fx{%) = 
T,i£A w ifli x )< where fi( x ) = Im k fx\YL(x\y,l)f Y \L(y\l)dy. 



The resulting pdf follows the local structure of the manifold 
M and accounts both for low- and high-dimensional noise. 

As an example [5], let all be fc-dimensional zero- 

mean Gaussians with unit covariance matrices, fy\L(y\l) = 
Af(y,0,I) = {2n)- k / 2 exp^^y), and f x \ YL {x\y,l) = 
Af(x;pi + Aiy,T^i), VZ G A, for some means pi G JR n , 
covariance matrices and nxk matrices Ai, so that fx(x) = 
J2i eA wiAf(x- 1 ^,A l A t l +T, l ). 

III. Complexity-regularized mixture models 

Consider a random vector X G 1R™ with an absolutely 
continuous distribution Pf, described by a pdf /. We wish to 
find a mixture model that would not only yield a good "local" 
approximation to /, but also have low complexity, where the 
precise notion of complexity depends on application. 

In order to set this up quantitatively, we use a complexity- 
regularized adaptation of the quantizer mismatch approach of 
Gray and Linder [6]. We seek a finite collection T = {g m : 
m G Ai} of pdf's from a class Q of "admissible" models 
and a measurable partition 1Z = {R m : m G Ai} of 1R™ that 
would minimize the objective function 

I f (n,T)= P f (R m )[D(f m \\g m ) + ^r(g m )], (D 

where f m is the pdf defined as l{ xeRm }f(x)/Pf{R m ), D{-\\-) 
is the relative entropy, $r(.9m) is a regularization functional 
that quantifies the complexity of the mth model pdf relative 
to the entire collection V, and p > is the parameter that 
controls the trade-off between the relative-entropy (mismatch) 
term and the complexity term. 

This minimization problem can be posed as a complexity- 
constrained quantization problem with an encoder a : TR n — > 
Ai corresponding to the partition 1Z = {R m } through a(x) = 
m if x G R m , a decoder (3 : Ai — » Q defined by (3(m) — g m , 
and a length function I : Ai — > {0,1,2,...} satisfying the 
Kraft inequality 



where p m = Pf(R m ). Then 



meM 



-i(m) < i \ n or( j er to describe the 
encoder and to quantify the performance of the quantization 
scheme, we need to choose a distortion measure between 
an input vector and an encoder output in such a way that 
minimizing average distortion would yield the /-functional Q 
of the corresponding partition and codebook. 

Consider the distortion p(x 1 m) = In (f(x)/g m (x)) + 
£(m) + /i$r(5m) (this is not a distortion measure in the strict 
sense since it can be negative, but its expectation with respect 
to / is nonnegative by the divergence inequality). For a given 
codebook V and length function £, the optimal encoder is the 
minimum-distortion encoder a(x) — argmin mgA/1 p(x, m) 
with ties broken arbitrarily. The resulting partition 1Z = {R m } 
yields the average distortion 



E f p(X,a(X)) = Y Pr> 



£(m) + p<S>r{gn 



i /' f m (x) ln Pm f™^ dx 



E f p(X,a{X)) =Y,P 

Pm 



In- 



D{f m \\gv 
M*r(5m) 



e -f(m) 

> Yl Pm [ D Um\\g m ) + A**r(ffm)] , 

with equality if and only if l(m) — — \np m . Thus, the optimal 
decoder and length function for a given partition are such 
that the average p-distortion is precisely the /-functional. We 
can therefore iterate the optimality properties of the encoder, 
decoder and length function in a Lloyd-type descent algorithm; 
this can only decrease average distortion and thus the I- 
functional. Note that the ln/(x) term in p(x,m) does not 
affect the minimum-distortion encoder. Thus, as far as the 
encoder is concerned, the distortion measure po(x,m) = 
— In g m (x) + £(m) + ^$r(.9m) is equivalent to p. 

When the distribution of X is unknown, we can take a 
sufficiently large training sample X N = (X% , . . . , Xn) and 
use a Lloyd descent algorithm to empirically design a mixture 
model for the data: 

1) Initialization: begin with an initial codebook T = {g^ 1 : 
m G Ai} C Q, where Q is the class of admissible models, 
and a length function £ (0) : Ai — ► {0, 1, 2, . . .}. Set iteration 
number r = 1, pick a convergence threshold e, and let Dq be 
the average po-distortion of the initial codebook. 

2) Minimum-distortion encoder: encode each sample X; into 
the index a^ r) {Xi) = argmin meA4 p (X i} 3m _1) ). 

3) Centroid decoder: update the codebook by minimizing 
over all g G Q the empirical conditional expectation 

E[p (X,g)\a^(X)=m] = -L £ Po (X t ,g), 



i:a.( r > (Xi)—m 



where N, 



(r) 



\{i : a^(Xi) = m}\, i.e., set ^W(m) 



g m ' = argmin geg E [p (X , g)\a {r) '{X) = m] . 

4) Optimal length function: if > 0, let ^{m) = 
-\npin ] , where = N$/N. If n£ } = 0, remove the 
corresponding cell from the code and decrease \Ai \ by 1. 

5) Test: compute the average p-distortion D r with the code 
(a' r ^( r »,|W). If (£),._! -D r )/D r _i < e, quit. Otherwise, 
go to Step 2 and continue. 

With a judicious choice of the initial codebook and 
length function, this algorithm yields a finite mixture model 
{(dm>Pm) '■ m G A 7 !} as a good "fit" to the empirical 
distribution of the data in the sense of near-optimal trade-off 
between the local mismatch and complexity. 

IV. Application to NLDR 

Given a training sample X N = (A 1; . . . , X^) of "raw" n- 
dimensional data and assuming its intrinsic dimension k < n is 
known, our goal is to determine two mappings, v : fft" — > M fc 
and w : JR k -> IR", where v maps high-dimensional vectors to 
their dimension-reduced versions and w maps back to the high- 
dimensional space. In general, the dimension-reducing map v 



entails loss of information, so w(v(x)) ^ x. Therefore we will 
be interested in the average distortion incurred by our scheme, 
d(v,w) = E[d(X,w(v(X)))}, where (i:R"xE"-> [0, oo) 
is a suitable distortion measure on pairs of ?i-vectors, e.g., the 
squared Euclidean distance, and the expectation is w.r.t. the 
empirical distribution of the sample. 

A. Mixture model of a stochastic embedding 

The first step is to use the above quantization scheme to fit a 
complexity -regularized Gaussian mixture model to the training 
sample. Our class Q of admissible model pdf's will be the set 
of all n-dimensional Gaussians with nonsingular covariance 
matrices, Q = {J\f(x;p,K) : p G JR n ,detK > 0}, and 
for each finite set V C Q we shall define a regularization 
functional $r : T — > [0, oo) that penalizes those g G T that 
are "geometrically complex" relative to the rest of T. 

The idea of "geometric complexity" can be motivated [5], 
[7] by the example of the Gaussian mixture model from 
Sect. HIl The covariance matrix of the Zth component, AiA\ + 
Hi, is invariant under the mapping Ai i— » AiR, where R 
is a k x k orthogonal matrix, i.e., RR f = I. In geometric 
terms, a copy of the orthogonal group Ok associated with the 
Zth component of the mixture is the group of rotations and 
reflections in the tangent space to M at i^j (0). Thus, the 
log-likelihood term in po is not affected by assigning arbitrary 
and independent orientations to the tangent spaces associated 
with the components of the mixture. However, since our goal 
is to model the intrinsic global geometry of the data, it should 
be possible to smoothly glue together the local data provided 
by our model. We therefore require that the orientations of the 
tangent spaces at "nearby" points change smoothly as well. (In 
fact, one has to impose certain continuity requirements on the 
orientation of the tangent spaces in order to define measure 
and integration on the manifold [4, Ch. XI].) 

Given a finite set V C G, we shall define the regularization 
functional $r : T — » [0, oo) as 

*r(ff)= E <^^9'W\\g), (2) 
g'er\{ g } 

where k : IR™ x IR™ — * IR + is a smooth positive symmetric 
kernel such that k(x, x') — > as ||x — x'|| — > oo, and 

D(g'\\g) = i(mdet(i^%) +Tr(K- 1 K g ,) 

+ (Vg ~ fJ-g'YKg 1 ^ - Vg<)- n ) 

is the relative entropy between two Gaussians. Possible choices 
for the kernel k are the inverse Euclidean distance k(x, x') — 
\\x— x'W^ 1 [8], a Gaussian kernel k(x,x') — N(x — x'\ 0, a 2 I) 
for a suitable value of a [7], [8] or a compactly supported 
"bump" k(x, x') = ip ri . r2 (x — x'), where ip ri ,r 2 is an infinitely 
differentiable reflection-symmetric function that is identically 
zero everywhere outside a closed ball of radius T2 and one 
everywhere inside an open ball of radius v\ < r^. The 
relative entropy serves as a measure of position and orientation 
alignment of the tangent spaces, while the smoothing kernel 
ensures that more weight is assigned to "nearby" components. 



This complexity functional is a generalization of the "global 
coordination" prior of Brand [7] to mixtures with unequal 
component weights. 

With these definitions of Q and $r, the po-distortion for a 
codebook T = {g m :meM} and a length function £ is 

po(x,m) = - lndct if m + ~(x - n m ) t K^{x - p, m ) 
+£(m) + E n(p )D{g m >\\g m ), 

m> eM\{m} 

where we have also removed the (n/2) ln(27r) term as it does 
not affect the encoder. The effect of the geometric complexity 
term is to curve the boundaries of the partition cells according 
to locally interpolated "nonlocal information" about the rest of 
the codebook. Determining the Lloyd centroids for the decoder 
will involve solving \A4\ simultaneous nonlinear equations 
for the means and the same number of equations for the 
covariance matrices. For computational efficiency we can use 
the kernel data from the previous iteration, which would 
sacrifice optimality but avoid nonlinear equations. 

B. Design of reduction and reconstruction maps 

The output of the previous step is a Gauss mixture model 
{(g m ,Pm) ■ Tn G M} and a partition 1Z = {R m } of IR™. 
Suppose that for each m G M. the eigenvectors , . . . , e« 
of K m are numbered in the order of decreasing eigenvalues, 
Ai > . .. > a!™" 1 . The next step is to design the dimension- 
reducing map v and the reconstruction map w. One method, 
proposed by Brand [7], is to use the mixture model of the 
underlying pdf [obtained in his case by an EM algorithm with 
a prior corresponding to the average of the complexity <E>r (<?) 
over the entire codebook and with equiprobable components of 
the mixture] to construct a mixture of local affine transforms, 
preceded by local Karhunen-Loeve transforms, as a solution 
to a weighted least-squares problem. 

However, we can use the encoder partition 7Z directly: 
for each m G M, let v m (x) = Tl m ( x — Pm), where Ii m 
is the projection onto the first k eigenvectors of K m , and 
then define v(x) = J2meM^{xeR m } v m(x)- This approach 
is similar to local principal component analysis of Kambhatla 
and Leen [9], except that their quantizer was not complexity- 
regularized and therefore the shape of the resulting Voronoi 
regions was determined only by local statistical data. We 
can describe the operation of dimension reduction (feature 
extraction) as an encoder v : IR™ — > M. x IR fc , so that 
v[x) = (a(x),v a / x \(x)), where a is the minimum-distortion 
encoder for the po -distortion. 

The corresponding reconstruction operation can be designed 
as a decoder w : M. X H fc — > IR™ which receives a 
pair (m,u), m G M.,u G IR fe , and computes w m (u) = 
Pm + J2i=i( u > e i"^) e i m ^ where (•, •} denotes the usual scalar 
product in IR fc . 

This encoder-decoder pair is a composite Karhunen- 
Loeve transform coder matched to the mixture source g = 
Yl m Pmgm- If the data alphabet X is compact, then the 
squared-error distortion is bounded by some A > 0, and 



the mismatch due to using this composite coder on the 
disjoint mixture source / = ^ m p m / m can be bounded 
from above by A\\f — g\\i, where || • ||i is the L\ norm. 
Provided that the mixture g is optimal for / in the sense 
of minimizing the p-distortion, we can use Pinsker's in- 
equality [10, Ch. 5] ||/ - gh < y/2D(f\\g) and convexity 
of t he relative entropy to further b ound the mismatch by 

Ay/2(l f (K,T)-»i: m p m *r(9 m )). 

Note that the maps v and w are not smooth, unlike the 
analogous maps of Brand [7], [8]. This is an artifact of the hard 
partitioning used in our scheme. However, hard partitioning 
has certain advantages: it allows for use of composite codes 
[6] and nonlinear interpolative vector quantization [11] if 
additional compression of dimension-reduced data is required. 
Moreover, the lack of smoothness is not a problem in our 
case because we can use kernel interpolation techniques to 
model the geometry of dimension-reduced data by a smooth 
manifold, as explained next. 

C. Manifold structure of dimension- reduced data 

Our use of mixture models has been motivated by certain 
assumptions about the structure of stochastic embeddings of 
low-dimensional manifolds into high-dimensional spaces. In 
particular, given an n-dimensional Gaussian mixture model 
{(9m, Pm) ■ rn G M}, we can associate to each component 
of the mixture a chart of the underlying manifold, such that 
the image of the chart in H fc is an open ball of radius r rn = 
(A^ m) ) 1/2 centered at the origin, and we can take the first 
k eigenvectors of the covariance matrix of g m as coordinate 
axes in the tangent space to the manifold at the inverse 
image of G H fc under the mth chart. Owing to geometric 
complexity regularization, the orientations of tangent spaces 
change smoothly as a function of position. 

Ideally, one would like to construct a smooth manifold 
consistent with the given descriptions of charts and tangent 
spaces. However, this is a fairly difficult task since we not only 
have to define a smooth coordinate map ip m for each chart, but 
also make sure that these maps satisfy the chart compatibility 
condition. Instead, we can construct the manifold implicitly 
by gluing the coordinate frames of the tangent spaces into an 
object having a smooth inner product. 

Specifically, let us fix a sufficiently small S > 0, and let 
ip m be an infinitely differentiable function that is identically 
zero everywhere outside a closed ball of radius r m and one 
everywhere inside an open ball of radius r m — S, with both 
balls 



centered at W m p m . Let rj m (u) = ^ J^" ( ^ (tt) . The 

inner product of two vectors u, v! G IR fe , treated as elements 
of the tangent space T -i, Q <.M, is given by (u,u') m = 

J2i=i( u , e i 7n ' > )( e i m \ u ')- Then for each y G H fc the map 

g y : M k x H fc -> [0, oo), 

g y (u,u')= ^ VmiV + U. m fJ m )(u,u') m , 

is a symmetric form, which is positive definite whenever 
Vm(y + n m /x m ) ^ for at least one value of m. In addition, 



the map y i— > g y (-, •) is smooth. In this way, we have implicitly 
defined a Riemannian metric [4, Ch. VII] on the underlying 
manifold. The functions i] m form a so-called smooth partition 
of unity, which is the only known way of gluing together local 
geometric data to form smooth objects [4, Ch. II]. 

In geometric terms, i] m (y + n m /x m ) = for all m if and 
only if y G IR is an image under the dimension-reduction 
map of a point in IR™ whose first k principal components 
w.r.t. each Gaussian in the mixture model fall outside the 
covariance ellipsoid of that Gaussian. If the mixture model is 
close to optimum, this will happen with negligible probability. 
A practical advantage of this feature of our scheme is in 
rendering it robust to outliers. 

V. Consistency and codebook design 

Our mixture modeling scheme can also be used to estimate 
the "true" but unknown pdf /* of the high-dimensional data, 
if we assume that /* belongs to some fixed class T. Indeed, 
the empirically designed codebook T = {g m : m <E M.} of 
Gaussian pdf's, the corresponding component weights {p m }, 
and the mixture g = ^2 meM p m gm are random variables since 
they depend on the training sample X N . We are interested in 
the quality of approximation of /* by the mixture g = g(X N ). 

Following Moulin and Liu [12], we use the relative-entropy 
loss function D(f*\\g). We shall give an upper bound on the 
loss in terms of the index of re solvability [12] 



R»,n(D = min 



D(f*\\g m ) 



fi>L(g ri 



N 



where L(g m ) = <&r(5m) — lnp m , which quantifies how well 
/* can be approximated, in the relative-entropy sense (and, by 
Pinsker's inequality, in L\ sense), by a Gaussian of moderate 
geometric complexity relative to the rest of the codebook. We 
have the following result: 

Theorem V.l: Let the codebook T = {g m : m G M} of 
Gaussian pdf's be such that the log-likelihood ratios U m = 
— In (/* (X) I g m (X)) uniformly satisfy the Bernstein moment 
condition [10], i.e., there exists some h > such that E \U m — 
EC/ m | fe < (l/2)Var(C/ m )fc!/i' £ - 2 forall fc > 2. Let M(/*) be 
the smallest number such that Var(C/ m ) < — M(/*) EU m for 
all m G M (owing to the Bernstein condition, it is nonnegative 
and finite). Then, for any fi > h + M(/*)/2 and 5 > 0, 

pJ c(ril9)£ ^ (r)+ ^U 1 _ 2J , 

(3) 

where a = ■ The expected loss satisfies 



E[D(r\\g)}<-^R,M.n + 

1 — Q 



4|A% 
(1 - a)N' 



(4) 



The probabilities and expectations are all w.r.t. the pdf /*. 

Proof: Due to the fact that $ r (ffm) > for all m G M, 
the composite complexity L(g m ) satisfies the Kraft inequality. 
Then we can use a strategy similar to that of Moulin and Liu 



[12] to prove that 



Pr lD(f 



\9m) > 



1 + a 
1-a 



(l-a)JVJ " 



25 
\M\ 



for each m G AL Hence, by the union bound 



D(f*\\9m) < 



1 



-Rfi,N(f* 



2 M ln^ 



1 — a' 1 *' 1 ' w ' ' (l-a)iV 

for all m £ .M, except for an event of probability at most 2d. 
By convexity of the relative entropy, D(f*\\g m ) < C for all 
m G M implies that D(f*\\g) < C for g = ^2 meM Pm9m' 
Therefore 



D(r\\g)< 



1 



1-a 



Rfj,,N(f*) 



(l-a)iV 

with probability at least 1 — 25. To prove (0, we use the 
fact [10] that if Z is a random variable with E|Z| < oo, 
then E[Z] < J °° Pr[Z > i]dt. We let Z = £>(/*%) - 

i , JVt(l-o) 

and choose 5 = \M\e ^~ . Then E[Z] < 

(i^l)N ' which P roves ©■ ■ 
To discuss consistency in the large-sample limit, con- 
sider a sequence of empirically designed mixture models 
{(9m \pm ) ■ rn £ A4^}. This is different from the usual 
empirical quantizer design, where we increase the training set 
size but keep the number of quantizer levels fixed. The scheme 
is consistent in the relative-entropy sense if EZ>(/*[|pW) — > 
as N -> oo, where = 2^meA4(«) P^ 9^ and the 

expectation is with respect to /*. 

A sufficient condition for consistency can be determined 
by inspection of the upper bound in Eq. (0}. Specif- 
ically, we require that the codebooks T^ N > satisfy: (a) 
max m6A4(«> L (9m ) = o(N), (b) mm TOeM( Ao D(/*||g m ) = 
o(l) for all /* e T, and (c) |7W (A,) | = o(N). Condition 
(c) can be satisfied by initializing the Lloyd algorithm by a 
codebook of size much smaller than the training set size N, 
which is usually done in practice in order to ensure good 
training performance. The first two conditions can also be 
easily met in many practical settings. 

Consider, for instance, the class T of all pdf's supported on 
a compact X C K™ and Lipschitz-continuous with Lipschitz 
constant c. Then, if we take as our class of admissible 
Gaussians Q = {jV(x;/j,,K) : /i £ < detK < 02} 

for suitably chosen constants c\ , C2 > independent of N, 
the relative entropy D(g\\g') of any two g,g' £ Q can be 
bounded independently of N, and condition (a) will be met 
with proper choice of the component weights. Condition (b) is 
likewise easy to meet since the maximum value of any /* G T 
depends only on the set X, the Lipschitz constant c, and the 
dimension n. 

In general, the issue of optimal codebook design is closely 
related to the problem of universal vector quantization [13]: 
we can consider, e.g., a class T of pdf's with disjoint supports 
contained in a compact X C IR n . Then a sequence of Gaussian 
codebooks that yields a consistent estimate of each /* G T in 
the large-sample limit is weakly minimax universal [13] for 



T and can also be used to quantize any source contained in 
the L\ -closed convex hull of T. 

VI. Discussion 

We have introduced a complexity-regularized quantization 
approach to NLDR. One advantage of this scheme over 
existing methods for NLDR based on Gaussian mixtures, e.g., 
[7], is that, instead of fitting a Gauss mixture to the entire 
sample, we design a codebook of Gaussians that provides a 
good trade-off between local adaptation to the data and global 
geometric coherence, which is key to robust geometric model- 
ing. Complexity regularization is based on a kernel smoothing 
technique that allows for a meaningful geometric description 
of dimension-reduced data by means of a Riemannian metric 
and is also robust to outliers. Moreover, to our knowledge, 
the consistency proof presented here is the first theoretical 
asymptotic consistency result applied to NLDR. 

Work is currently underway to implement the proposed 
scheme for applications to image processing and computer 
vision. Also planned is future work on a quantization-based 
approach to estimating the intrinsic dimension of the data and 
on assessing asymptotic geometric consistency of our scheme 
in terms of the Gromov-Hausdorff distance between compact 
metric spaces [14]. 
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